Welcome to R!

Purpose of this tutorial

R is a great resource that has extremely good documentation, including several free books that systematically teach users the basics and then some. This tutorial is not meant to replace any such resources but to provide an opportunity to practice your skills in R and R Studio by building bar charts. If you are interested in learning more R there will be a recommended book list at the end.

Skill Level

It is recommended that you have used R and the Tidyverse before. If you are new to R or the Tidyverse, we’d recommend starting with the UFOs Tutorial.

Proper use of tutorial

This tutorial is meant to test your skills and knowledge of R and the Tidyverse. While the code for each graph is available, it will be hidden to begin with. Try to recreate each graph using just the information given on it before looking at the original code.

Preparing your environment

If you already have set up a R Studio Account or are using R Studio, you can skip to loading in your libraries

R Studio Cloud

To complete this tutorial, you will need to create an account with RStudio Cloud so navigate to their homepage and click Get Started.

Once you’ve made your account, you will be navigated to Your Workspace. Click the blue New Project button to the right of Your Projects and wait for the Deploying Project bubbles to disappear.

R Script

Under File highlight New File and select R Script.

Go ahead and save the file. This will be where we will be executing all our commands.

Set up the environment

Open Tools and click on Global Options. There will be four sections, R Sessions, Workspace, History and Other. Under Workspace, uncheck Restore .RData into workspace at startup and change Save workspace to .RData on exit to never. Hit Ok to save and exit.

By changing these options, we are making it a little harder to pick up where you left off in a project after you close it, but this will make your code reproducible, allowing you to get the same results every time you run your code.

All of the functions we will be using are well documented. If you ever have questions, run them in the console, (lower left pane) with a ?in front, e.g. ?help()

R and Data

Load your libraries

This tutorial will be using the tidyverse, ggstance and ggthemes. If you don’t have a library installed, you can install it with a console command (i.e. install.packages("tidyverse")).

library(tidyverse)
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggstance)
## 
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
library(ggthemes)

You only need to install packages once, after that you can always load them for use using the library() call.

Loading data

We will be working with a semi-cleaned version of a UFO data set from Kaggle. The original data set can be found here. We will be only looking at data for the US. You can get the zipped file here: ufos.zip. R Studio Cloud only takes data in a zipped format, so we will upload the data set into our environment as a zip file.

Under the Files tab in the lower right pane, click on Upload button. Find the zipped file called ufos on your computer and upload it. You should see ufos.csv appear in your files.

Read in the csv as a tibble, which is a slightly more flexible version of a data frame and store it as a variable which we will use to call it for the rest of the lab.

ufos <- as_tibble(read_csv("ufos.csv"))

You will get a print out that looks similar to the one below.

You can ignore it.

You should see ufos show up as data set in your Global Environment on the upper right pane of your screen. If you click on it, you will open the data in a new tab.

Look at our data

Let’s take a quick look at our data using glimpse(), which will give us a compressed view of the data. Take note of the categorical and numerical types of entries; we will be using most of these later.

glimpse(ufos)
## Observations: 65,114
## Variables: 14
## $ datetime             <dttm> 1949-10-10 20:30:00, 1956-10-10 21:00:00...
## $ city                 <chr> "san marcos", "edna", "kaneohe", "bristol...
## $ state                <chr> "tx", "tx", "hi", "tn", "ct", "al", "fl",...
## $ country              <chr> "us", "us", "us", "us", "us", "us", "us",...
## $ shape                <chr> "cylinder", "circle", "light", "sphere", ...
## $ duration..seconds.   <dbl> 2700, 20, 900, 300, 1200, 180, 120, 300, ...
## $ duration..hours.min. <chr> "45 minutes", "1/2 hour", "15 minutes", "...
## $ comments             <chr> "This event took place in early fall arou...
## $ date.posted          <date> 2004-04-27, 2004-01-17, 2004-01-22, 2007...
## $ latitude             <dbl> 29.88306, 28.97833, 21.41806, 36.59500, 4...
## $ longitude            <dbl> -97.94111, -96.64583, -157.80361, -82.188...
## $ year                 <dbl> 1949, 1956, 1960, 1961, 1965, 1966, 1966,...
## $ month                <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
## $ day                  <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...


Time series:

Being able to work with time is an important skill when it comes to data. It also tends to work well with bar charts. The plots below will start with some basic graphs over time and move into specific filtering. These plots will be using some styling elements, and each will have a theme.

Observations by year

ufos %>% 
  ggplot(aes(year)) +
  geom_bar(fill = "steelblue") +
  theme_clean() +
  labs(
    subtitle = "Observations by year",
    caption = "Theme: theme_clean"
  )

Hint: Look at the names of the variables on the x and y axis. How can we get our bar chart to have those names?

Want to add labels to your graphs? Use labs() with title, subtitle, caption, x and y.

Observations since 1975

ufos %>% 
  filter(
    year >= 1975
  ) %>% 
  ggplot(aes(year)) +
  geom_bar(fill = "steelblue") +
  theme_clean() +
  labs(
    subtitle = "Observations since 1975",
    caption = "Theme: theme_clean"
  )

Hint: year is currently being read as numerical. How can we use that to our advantage when we filter it?

Observations between 1975 and 2000

ufos %>% 
  filter(
    between(year, 1975, 2000)
  ) %>% 
  ggplot(aes(year)) +
  geom_bar(fill = "steelblue") +
  theme_clean() +
  labs(
    subtitle = "Observations between 1975 and 2000",
    caption = "Theme: theme_clean"
  )

Hint: ?between()

Observations since 1975 for Colorado (co)

ufos %>% 
  filter(
    year >= 1975,
    state == "co"
  ) %>% 
  ggplot(aes(year)) +
  geom_bar(fill = "steelblue") +
  theme_clean() +
  labs(
    subtitle = "Observations since 1975 for Colorado (co)",
    caption = "Theme: theme_clean"
  )

By Month

ufos %>% 
  ggplot(aes(as.factor(month))) +
  geom_bar(fill = "steelblue") +
  theme_clean() +
  labs(
    subtitle = "Observations by month",
    caption = "Theme: theme_clean"
  )

By Month between 2000 and 2005

ufos %>% 
  filter(
    between(year, 2000, 2003)
  ) %>% 
  ggplot(aes(as.factor(month))) +
  geom_bar(aes(fill = as.factor(year)), position = "dodge") +
  theme_fivethirtyeight() +
  labs(
    subtitle = "Observations by Month and year between 2000 and 2003",
    caption = "Theme: theme_fivethirtyeight
    Fill: scale_fill_fivethirtyeight()"
  ) +
  scale_fill_fivethirtyeight()

State and Shape

Another common plot you will come across, which we tried out in the last plot is two variables, colored by a third variable. The following plots will be doing a lot of counting and filtering. They will also be using the geom_barh() from the ggstance package.

States with more than 1K UFO sightings

ufos %>% 
  count(state) %>% 
  filter(n > 1000) %>% 
  ggplot(aes(n, reorder(state, n))) +
  geom_barh(stat = "identity", 
            fill = "indianred4",
            color = "black") +
  labs(
    subtitle = "States with more than 1K UFO sightings",
    caption = "Theme: theme_gdocs"
  ) +
  theme_gdocs()

States with less than 500 UFO sightings

ufos %>% 
  count(state) %>% 
  filter(n < 500) %>% 
  ggplot(aes(n, reorder(state, n))) +
  geom_barh(stat = "identity",
            fill = "indianred4",
            color = "black") +
  labs(
    subtitle = "States with less than 500 UFO sightings",
    caption = "Theme: theme_gdocs"
  ) +
  theme_gdocs()

Create a useful data set

For the next four graphs, we want to focus on the shapes of UFOs. However, since there are a lot of shapes, we will only look at the top 5.

ufos %>%  count(shape) %>% arrange(desc(n))
## # A tibble: 29 x 2
##    shape         n
##    <chr>     <int>
##  1 light     13473
##  2 triangle   6549
##  3 circle     6118
##  4 fireball   5148
##  5 unknown    4567
##  6 other      4466
##  7 sphere     4347
##  8 disk       4121
##  9 oval       3032
## 10 formation  1990
## # ... with 19 more rows

To make the data set, we need to filter so we only have observations related to light, triangle, circle, fireball and sphere. We also added a row to our data set that contains the number of observations by state.

ufos_shapes <- ufos %>%  
  filter(  shape == "light" | 
    shape == "triangle" | 
    shape == "circle" |
    shape == "fireball" |
    shape == "sphere") %>% 
  group_by(state) %>% 
  mutate(n = n()) %>% 
  ungroup()

Number of Shapes in states with more than 1K UFO sightings

ufos_shapes %>%  
  filter(n > 1000) %>% 
  ggplot(aes(reorder(state, n))) +
  geom_bar(aes(fill = shape)) +
  labs(
    subtitle = "States with more than 1K UFO sightings",
    caption = "Theme: theme_gdocs"
  ) +
  theme_gdocs() +
  scale_fill_gdocs()+
  coord_flip()

Number of Shapes in states with less than 500 UFO sightings

ufos_shapes %>% 
  filter(n < 500) %>% 
  ggplot(aes(reorder(state, n))) +
  geom_bar(aes(fill = shape)) +
  labs(
    subtitle = "States with less than 500 UFO sightings",
    caption = "Theme: theme_gdocs"
  ) +
  theme_gdocs() +
  scale_fill_gdocs()+
  coord_flip()

Colored by shapes

ufos_shapes %>% 
  filter(  shape == "light" | 
    shape == "triangle" | 
    shape == "circle" ) %>% 
  filter(n < 500) %>% 
  ggplot(aes(reorder(state, n))) +
  geom_bar(aes(fill = shape)) +
  labs(
    subtitle = "States with less than 500 UFO sightings",
    caption = "Theme: theme_tufte",
    x = "State",
    y = "Observations"
  ) +
  theme_tufte() +
  theme(legend.position = "none") +
  facet_wrap(~shape) +
  scale_fill_brewer(type = "qual") +
  coord_flip()

Hint: ?facet_wrap(); (theme(legend.position = "none"))https://www.datanovia.com/en/blog/how-to-remove-legend-from-a-ggplot/#ggplot-with-no-legend

Calculations

Another type of graph you will want to use is had calculated values. The best way to do this is with summarize:

Average time in minutes per shape

ufos_shapes %>% 
  group_by(shape) %>% 
  summarise(average_min = mean(duration..seconds./60)) %>% 
  ungroup() %>%
  ggplot(aes(shape, average_min)) +
  geom_bar(fill = "#55752f",
           color ="grey9",
           stat = "identity") +
  theme_pander() +
  labs(
    title = "Average minutes per shape",
    caption = "Calculated with duration..seconds. using ufo_shapes 
    theme: theme_pander
    fill: #55752f
    color: grey9"
  )

Hint: Use summarise and mean

Appendix

Congratulations on finishing the tutorial! Now see if you can think up and answer some of your own questions!

Download R

Did you love R? You can download it onto your computer for unlimited and offline use!